Goals

The goals of preprocessing - delete possible duplicates and generate weak labels based on the data itself.

I will proceed from hypothesis that logo is something popular, so it can be represented in the data more one times. Of course, new logos could be difficult to detect using this approach, but as the very first step I should check that. And that task strontly connected to the duplicates elimination - as main preprocessing step before training models.

Because duplicates in the dataset with different names could lead to data leakage, ambigous labels and overfitting in the very edge cases.

Also I should try to generate weak labels from the data to train model classifier. I don't think this problem can be fully solved via unsupervised methods, because discrimination between real and fake logos is subjective process and requires at least approximate labels. The most prominent approach here is semi-supervised learning (SSL) , when we have a lot of unlabeled data and the small percentage of real or approximate labels. These requirments allows to use typical deep learning classification models with modified semi-supervised pipeline.

Preprocessing steps

1) Find duplicates using perceptual hash similary
2) Filter duplicates, compute number of occurences
3) Use number of occurences > 1 as condition to be "true" labeled
4) Use Google API Logo Detection and label 1000 images

Metric discussion

The resulted model should have low number of False Positives , so it leads to Precision = $\frac{TP}{TP+FP} \rightarrow \max$ . However, we don't have any true label, except of weak labels that we can extract from the data. In this case, high precision over weak labels couldn't give guarantess model will give correct results.

That why I labeled 2014 images from dataset and will use it for validation of semi-supervised pipeline (and I don't use it for training).

I didn't use any other datasets with logos, because, here we have a lot of misdetections in the images and a lot of different logos as well. Typical datasets like WebLogo-2M has only 194 logos, so I think very little compared to this dataset (I think so after visualization of data). So my decision to mark part of the data for validation will simplify process of final testing of the different methods.

Part 1

Suppose model will return $1.0$ score for every image: $f(x) = 1$. Using random images visualization, we can estimate approximate precision of such model and set it as "benchmark" to compare with future classifier. I counted myself false positives few runs of images displaying and called it visual Precision:

$vPrecision = 73.5\%$
However, it's very optimistic estimate, because computed precision from the labeled part gives:
$Precision = 37.93\%$
So I will use the $37.93\%$ as baseline that I should increase with some methods.

How quickly compute similarity between images in entire dataset? I decided to use perceptual hash for this task. It's very fast, doesn't use any pretrained model and can find approximate duplicates.

Perceptual hash over input image generates 8x8 boolean descriptor. If images are similar their boolean descriptors will be similar and hence have low or zero element-wise distance - chose threshold $T$ and get images with distance $
$A,\ B \in {\{0,\ 1\}}^{8\times8}$ with distance $d(A,\ B)=\sum_{ij} |a_{ij} - b_{ij}| \in \{0,...,64\}$
Via linear transformation $2x + 1$ applyied to the matrices, elements are transformed from $\{0,\ 1\}$ to $\{-1,\ 1\}$, we can construct functional using Einstein summation notation to compute similarities. In this case, similar images should have similarity score $>T$.

$A,\ B \in {\{-1,\ 1\}}^{8\times8}$ with similarity $s(A,\ B)=\sum_{ij} a_{ij}b_{ij} \in \{-64, ..., 64\}$

Seems like very noisy, but for the first step will be okay.

Google API

I tried to use Python Google API Logo Detection to collect additional, more robust, labels. It allows free tier in 1000 labels per month, so I used it.

Via visual inspection of precision using random samples from 1000 labeled images, I computed approximate precision of Google API: $Precision = 39.6%$. It's very low, so I decided don't use its positive predictions. However, I observed that if model doesn't predict logo, it looks reliable enough. So I decided to use its negative predictions as weak labels for negative class.

So, as the result, I use dataset with weak labels for supervised and semi-supervised settings.

Part 2

I desided to boost labels based on the perceptual hash similarity and Google API score with additional information - about image entropy and its possible text on the image. Because, during my exploration of the dataset, I noticed few the most popular misdetections:

1) Text over single-colored background - not logo, it has text and probably low entropy
2) Random crops like elements of cars, real world - possibly has high entropy
3) Text from images can give additional information

After this step I executed OCR extraction in the Google Colab, so from here dataframe has an additional column ocr .